Vectorize Enumerable.Range(...).ToArray() by neon-sunset · Pull Request #80633 · dotnet/runtime

neon-sunset · 2023-01-13T22:30:56Z

Loop body is almost the same to what LLVM produces for Rust's let vec: Vec<i32> = (0..count).collect() with x86-64-v3 target except the unroll factor it chooses is 4 while here we pick 2 instead because it saturates store throughput on Zen 3 anyways and has much smaller codegen size.

NB: on ARM64 it emits a pair of strs vs LLVM's single stp but I'm not sure if that's an optimization .NET can do today out of two adjacent Unsafe.WriteUnaligned.

Codegen (.NET 8 daily build + FullPGO):

; BenchFixture.Enumerable2+RangeIterator.InitializeSpanCore(System.Span`1<Int32>)
       vzeroupper
       mov       rax,[rdx]
       mov       edx,[rdx+8]
       mov       r8d,edx
       sar       r8d,1F
       and       r8d,0F
       add       r8d,edx
       and       r8d,0FFFFFFF0
       mov       r9d,edx
       sub       r9d,r8d
       vmovupd   ymm0,[7FFDC3F91140]
       vpbroadcastd ymm1,dword ptr [rcx+8]
       mov       r8,1457FC040C8
       mov       r8,[r8]
       vpaddd    ymm1,ymm1,[r8+8]
       vpaddd    ymm2,ymm1,[7FFDC3F91160]
       mov       r8d,edx
       sub       r8d,r9d
       movsxd    r8,r8d
       lea       r8,[rax+r8*4]
       cmp       rax,r8
       je        short M03_L01
       nop       dword ptr [rax]
M03_L00:
       vmovupd   [rax],ymm1
       vmovupd   [rax+20],ymm2
       vpaddd    ymm1,ymm1,ymm0
       vpaddd    ymm2,ymm2,ymm0
       add       rax,40
       cmp       rax,r8
       jne       short M03_L00
M03_L01:
       sub       edx,r9d
       mov       r8d,[rcx+8]
       add       edx,r8d
       add       r8d,[rcx+0C]
       cmp       edx,r8d
       jge       short M03_L03
M03_L02:
       mov       [rax],edx
       add       rax,4
       inc       edx
       cmp       edx,r8d
       jl        short M03_L02
M03_L03:
       vzeroupper
       ret
; Total bytes of code 158

This is a port of https://github.com/neon-sunset/RangeExtensions/blob/main/RangeExtensions/RangeEnumerable.ICollection.cs#L144 except it doesn't need to support bi-directional scenario.

ghost · 2023-01-13T22:31:10Z

Tagging subscribers to this area: @dotnet/area-system-linq
See info in area-owners.md if you want to be subscribed.

Issue Details

null

Author:	neon-sunset
Assignees:	-
Labels:	`area-System.Linq`, `community-contribution`
Milestone:	-

neon-sunset · 2023-01-13T22:54:31Z

Numbers (there is sadly a lot of variability on my system so please consider median too):

BenchmarkDotNet=v0.13.3, OS=Windows 11 (10.0.22621.963)
AMD Ryzen 7 5800X, 1 CPU, 16 logical and 8 physical cores
.NET SDK=8.0.100-alpha.1.23062.10
  [Host]     : .NET 8.0.0 (8.0.23.5802), X64 RyuJIT AVX2
  DefaultJob : .NET 8.0.0 (8.0.23.5802), X64 RyuJIT AVX2


|     Method | Value |         Mean |      Error |     StdDev |       Median | Ratio | RatioSD | Code Size |
|----------- |------ |-------------:|-----------:|-----------:|-------------:|------:|--------:|----------:|
| ToArrayOld |    10 |     7.160 ns |  0.2435 ns |  0.6748 ns |     7.395 ns |  1.00 |    0.00 |      84 B |
| ToArrayNew |    10 |     9.170 ns |  0.4456 ns |  1.3140 ns |     8.591 ns |  1.29 |    0.21 |     220 B |
|            |       |              |            |            |              |       |         |           |
| ToArrayOld |   100 |    38.979 ns |  0.8190 ns |  2.0395 ns |    38.957 ns |  1.00 |    0.00 |      84 B |
| ToArrayNew |   100 |    27.410 ns |  0.5920 ns |  1.5802 ns |    27.395 ns |  0.71 |    0.06 |     363 B |
|            |       |              |            |            |              |       |         |           |
| ToArrayOld | 10000 | 3,298.214 ns | 60.8086 ns | 81.1778 ns | 3,276.112 ns |  1.00 |    0.00 |      84 B |
| ToArrayNew | 10000 |   644.097 ns | 12.3432 ns | 26.8330 ns |   640.007 ns |  0.20 |    0.01 |     390 B |

(data is for direct ToArray cost without interface calls and RangeIterator allocation)

stephentoub · 2023-01-13T22:57:54Z

Thanks, @neon-sunset, for working on this. I've avoided doing this kind of vectorization in the past because I couldn't find any meaningful use outside of tests, and the impact on maintainability / size wasn't worth it. Can you share any information on where you see this optimization benefiting consumers? Any examples you can point to? Thanks!

neon-sunset · 2023-01-13T23:17:53Z

Can you share any information on where you see this optimization benefiting consumers?

Sorry, I cannot. The initial implementation was done simply because .NET numbers weren't competitive with Rust or C++ and it was fun to iterate with BDN assembly output to achieve parity.

As far as usages in dotnet/runtime go, I do not see an easy way to locate all occurrences where an array is manually initialized to a range of values while most usages of Enumerable.Range itself seem to be distributed in tests just like you mentioned.

Also, List initialization can still be improved even without vectorization through providing ICollection<T> stub or allowing List<T> constructor to resolve IListProvider<T>.

xtqqczze · 2023-01-16T01:02:33Z

https://grep.app/search?q=Enumerable%5C.Range%5C%28%5B%5E%29%5D%2B%5C%29%5C.ToArray&regexp=true&case=true

stephentoub · 2023-01-16T14:18:15Z

https://grep.app/search?q=Enumerable%5C.Range%5C%28%5B%5E%29%5D%2B%5C%29%5C.ToArray&regexp=true&case=true

Yes. Now which of those aren't:

In a test, and/or
In a sample, and/or
So small a range that vectorization doesn't help or even hurt, and/or
On a use-once or only a few times code path

?

neon-sunset · 2023-01-17T09:34:10Z

Pushed a fix for impl. and feedback. Will close the PR since a change is deemed inappropriate for System.LINQ after the tests pass so that someone else can pick it up if it is ever reconsidered in the future.

Will also open a separate issue because materialization to List<T> can be sped up quite a bit by resolving IListProvider<T> and initializing an array used to construct the list directly.

stephentoub · 2023-01-17T11:05:22Z

since a change is deemed inappropriate

To be clear, I'd be happy if we could demonstrate this was valuable. I just haven't seen enough evidence of that yet.

neon-sunset · 2023-01-17T23:39:30Z

There are a few interesting places that initialize arrays directly:

and last but not least https://grep.app/search?q=%5Bi%5D%20%3D%20i%3B&case=true&filter[lang][0]=C%23&filter[repo][0]=mathnet/mathnet-numerics&filter[path][0]=src/Numerics/

Still, Enumerable.Range itself is mostly used in tests for initializing short arrays. I agree with your initial comment that merging the PR is of questionable profitability - most perf-sensitive locations that could benefit from such change would be then hurt by the cost of virtual calls and RangeIterator allocation or may just outright not use System.Linq because it is usually a performance concern, not an improvement.

If you don't think it's worth keeping around or moving the implementation to some internal helper method, feel free to close. Thanks!

stephentoub · 2023-01-20T15:47:33Z

You shouldn't need unsafe code here to avoid bounds checks.

int cur = _start; for (int i = 0; i < destination.Length; i++) { destination[i] = cur; cur++; }

stephentoub · 2023-01-20T15:48:12Z

This won't be inlined. I'd expect this to regress for very short ranges due to the extra method call that wasn't there before.

dotnet#80633 (comment)

* Remove `RangeIterator.ToArray` bounds check #80633 (comment) * Eliminate additional `ToArray` bounds checks

* Remove `RangeIterator.ToArray` bounds check dotnet#80633 (comment) * Eliminate additional `ToArray` bounds checks

xtqqczze · 2023-01-24T15:22:30Z

@neon-sunset Could you report new BenchmarkDotNet results after #81001?

jeffhandley · 2023-02-14T02:14:19Z

Hi @neon-sunset. Thanks for getting this contribution going. There are a couple of feedback comments that still need to be addressed, and there's a conflict that has emerged too. I'm going to mark this as needs-author-action to await your next iteration. Let us know if you need assistance moving it forward at this point though.

neon-sunset · 2023-02-15T05:58:56Z

Hi @jeffhandley Thanks, I will address feedback later this week.

neon-sunset · 2023-02-24T12:07:40Z

Failures seem to be unrelated, I think?
Replacing GC.AllocateUninitializedArray with simple new int[] fixes regression on small inputs. In fact, it does not seem to be profitable at all. AVX2 improvement should be greater, but I cannot test this because my other device has died.

Below numbers are for direct ToArray() calls. On short lengths, the cost of interface calls and iterator allocation dwarfs everything else :(

BenchmarkDotNet=v0.13.3, OS=macOS 13.2.1 (22D68) [Darwin 22.3.0]
Apple M1 Pro, 1 CPU, 8 logical and 8 physical cores
.NET SDK=8.0.100-preview.2.23121.17
  [Host]        : .NET 8.0.0 (8.0.23.11601), Arm64 RyuJIT AdvSIMD
  DefaultJob    : .NET 8.0.0 (8.0.23.11601), Arm64 RyuJIT AdvSIMD
  NativeAOT 8.0 : .NET 8.0.0-preview.2.23116.1, Arm64 NativeAOT AdvSIMD

|        Method |           Job |       Runtime |  Length |          Mean |       Error |      StdDev | Ratio |
|-------------- |-------------- |-------------- |-------- |--------------:|------------:|------------:|------:|
|       ToArray |    DefaultJob |      .NET 8.0 |       1 |      3.025 ns |   0.0167 ns |   0.0156 ns |  1.00 |
|     ToArrayPR |    DefaultJob |      .NET 8.0 |       1 |      3.066 ns |   0.0082 ns |   0.0069 ns |  1.01 |
|               |               |               |         |               |             |             |       |
|       ToArray | NativeAOT 8.0 | NativeAOT 8.0 |       1 |      3.294 ns |   0.0119 ns |   0.0111 ns |  1.00 |
|     ToArrayPR | NativeAOT 8.0 | NativeAOT 8.0 |       1 |      3.504 ns |   0.0036 ns |   0.0028 ns |  1.07 |
|               |               |               |         |               |             |             |       |
|       ToArray |    DefaultJob |      .NET 8.0 |       5 |      5.767 ns |   0.0210 ns |   0.0196 ns |  1.00 |
|     ToArrayPR |    DefaultJob |      .NET 8.0 |       5 |      5.991 ns |   0.0241 ns |   0.0226 ns |  1.04 |
|               |               |               |         |               |             |             |       |
|       ToArray | NativeAOT 8.0 | NativeAOT 8.0 |       5 |      5.707 ns |   0.0130 ns |   0.0121 ns |  1.00 |
|     ToArrayPR | NativeAOT 8.0 | NativeAOT 8.0 |       5 |      5.475 ns |   0.0175 ns |   0.0155 ns |  0.96 |
|               |               |               |         |               |             |             |       |
|       ToArray |    DefaultJob |      .NET 8.0 |      10 |      7.837 ns |   0.0246 ns |   0.0230 ns |  1.00 |
|     ToArrayPR |    DefaultJob |      .NET 8.0 |      10 |      5.552 ns |   0.0199 ns |   0.0176 ns |  0.71 |
|               |               |               |         |               |             |             |       |
|       ToArray | NativeAOT 8.0 | NativeAOT 8.0 |      10 |      9.208 ns |   0.0270 ns |   0.0253 ns |  1.00 |
|     ToArrayPR | NativeAOT 8.0 | NativeAOT 8.0 |      10 |      5.989 ns |   0.0116 ns |   0.0108 ns |  0.65 |
|               |               |               |         |               |             |             |       |
|       ToArray |    DefaultJob |      .NET 8.0 |     100 |     50.587 ns |   0.0668 ns |   0.0592 ns |  1.00 |
|     ToArrayPR |    DefaultJob |      .NET 8.0 |     100 |     23.730 ns |   0.0718 ns |   0.0600 ns |  0.47 |
|               |               |               |         |               |             |             |       |
|       ToArray | NativeAOT 8.0 | NativeAOT 8.0 |     100 |     45.158 ns |   0.1053 ns |   0.0985 ns |  1.00 |
|     ToArrayPR | NativeAOT 8.0 | NativeAOT 8.0 |     100 |     21.474 ns |   0.0487 ns |   0.0407 ns |  0.48 |
|               |               |               |         |               |             |             |       |
|       ToArray |    DefaultJob |      .NET 8.0 |  100000 | 46,401.337 ns | 636.9110 ns | 595.7669 ns |  1.00 |
|     ToArrayPR |    DefaultJob |      .NET 8.0 |  100000 | 23,084.711 ns | 350.5279 ns | 327.8841 ns |  0.50 |
|               |               |               |         |               |             |             |       |
|       ToArray | NativeAOT 8.0 | NativeAOT 8.0 |  100000 | 63,969.073 ns | 182.6120 ns | 170.8154 ns |  1.00 |
|     ToArrayPR | NativeAOT 8.0 | NativeAOT 8.0 |  100000 | 40,710.868 ns | 148.9555 ns | 124.3846 ns |  0.64 |

xtqqczze · 2023-02-24T21:17:52Z

Replacing GC.AllocateUninitializedArray with simple new int[] fixes regression on small inputs.

I'm surprised this is measurable, since AllocateUninitializedArray<T> already uses new[] for small arrays:

runtime/src/coreclr/System.Private.CoreLib/src/System/GC.CoreCLR.cs

Lines 658 to 664 in 7829252

    
                           // small arrays are allocated using `new[]` as that is generally faster. 
        
           #pragma warning disable 8500 // sizeof of managed types 
        
                           if (length < 2048 / sizeof(T)) 
        
           #pragma warning restore 8500 
        
                           { 
        
                               return new T[length]; 
        
                           }

xtqqczze · 2023-02-24T21:31:17Z

Apple M1 Pro, 1 CPU, 8 logical and 8 physical cores

So the best case ratio is 0.47, presumably with much larger code size.

jeffhandley · 2023-03-27T18:58:41Z

@krwq What do you think is the right path forward for this PR?

neon-sunset · 2023-03-27T21:06:45Z

@jeffhandley I will take another go at the change to make it more similar to how SpanHelpers.IndexOf* is implemented to get rid of the regression on small lengths (M1 hides it completely but Zen 3 does not). Currently, on win-x64, it regresses by 20% (plus extra 1-2ns) at ranges like 5 but improves by about 3.6x at lengths 1000+.

xtqqczze · 2023-03-30T13:18:57Z

@neon-sunset Might be worth a read: #84115.

ghost · 2023-04-18T21:02:20Z

This pull request has been automatically marked no-recent-activity because it has not had any activity for 14 days. It will be closed if no further activity occurs within 14 more days. Any new comment (by anyone, not necessarily the author) will remove no-recent-activity.

ghost · 2023-05-03T00:02:03Z

This pull request will now be closed since it had been marked no-recent-activity but received no further activity in the past 14 days. It is still possible to reopen or comment on the pull request, but please note that it will be locked if it remains inactive for another 30 days.

ghost added community-contribution Indicates that the PR has been added by a community member area-System.Linq labels Jan 13, 2023

EgorBo reviewed Jan 14, 2023

View reviewed changes

Comment thread src/libraries/System.Linq/src/System/Linq/Range.SpeedOpt.cs Outdated

neon-sunset mentioned this pull request Jan 17, 2023

[API Proposal]: List<T>.Create<TState>(int, TState, SpanAction<T, TState>) #80756

Closed

neon-sunset marked this pull request as ready for review January 18, 2023 00:18

stephentoub reviewed Jan 20, 2023

View reviewed changes

xtqqczze added a commit to xtqqczze/dotnet-runtime that referenced this pull request Jan 22, 2023

Remove RangeIterator.ToArray bounds check

40bb7db

dotnet#80633 (comment)

xtqqczze mentioned this pull request Jan 22, 2023

Eliminate ToArray bounds checks #81001

Merged

stephentoub pushed a commit that referenced this pull request Jan 23, 2023

Eliminate ToArray bounds checks (#81001)

a272954

* Remove `RangeIterator.ToArray` bounds check #80633 (comment) * Eliminate additional `ToArray` bounds checks

mdh1418 pushed a commit to mdh1418/runtime that referenced this pull request Jan 24, 2023

Eliminate ToArray bounds checks (dotnet#81001)

77aeaac

* Remove `RangeIterator.ToArray` bounds check dotnet#80633 (comment) * Eliminate additional `ToArray` bounds checks

jeffhandley added the needs-author-action An issue or pull request that requires more info or actions from the author. label Feb 14, 2023

ghost removed the needs-author-action An issue or pull request that requires more info or actions from the author. label Feb 15, 2023

neon-sunset closed this Feb 24, 2023

neon-sunset reopened this Feb 24, 2023

neon-sunset changed the title ~~Vectorize Enumerable.Range(...).ToArray()~~ [WIP] Vectorize Enumerable.Range(...).ToArray() Feb 24, 2023

Vectorize Enumerable.Range(...).ToArray()

cfc64ea

neon-sunset added 4 commits February 24, 2023 02:54

Fix impl. and address feedback

eda1c69

Reduce cost for short ranges

5d728e4

Remove unused reference

f7b6f3f

Try to fix tests

4ca8590

neon-sunset force-pushed the vectorize-range-initialization branch from 296f2bc to 4ca8590 Compare February 24, 2023 00:55

Address feedback

5d7f003

build-analysis Bot mentioned this pull request Feb 24, 2023

Roslyn source generator crash on mono/linux/arm64 #81123

Closed

neon-sunset requested review from EgorBo and stephentoub and removed request for EgorBo February 24, 2023 12:08

neon-sunset changed the title ~~[WIP] Vectorize Enumerable.Range(...).ToArray()~~ Vectorize Enumerable.Range(...).ToArray() Feb 24, 2023

build-analysis Bot mentioned this pull request Feb 24, 2023

CI legs failing with "Assembly 'illink' is missing dependency 'Mono.Cecil'" error #82611

Closed

jeffhandley added the needs-author-action An issue or pull request that requires more info or actions from the author. label Mar 27, 2023

eiriktsarpalis assigned krwq Apr 3, 2023

ghost added the no-recent-activity label Apr 18, 2023

ghost closed this May 3, 2023

ghost locked as resolved and limited conversation to collaborators Jun 2, 2023

This pull request was closed.

Conversation

neon-sunset commented Jan 13, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ghost commented Jan 13, 2023

Uh oh!

neon-sunset commented Jan 13, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

stephentoub commented Jan 13, 2023

Uh oh!

neon-sunset commented Jan 13, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

xtqqczze commented Jan 16, 2023

Uh oh!

stephentoub commented Jan 16, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

neon-sunset commented Jan 17, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

stephentoub commented Jan 17, 2023

Uh oh!

neon-sunset commented Jan 17, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

stephentoub Jan 20, 2023

Choose a reason for hiding this comment

Uh oh!

stephentoub Jan 20, 2023

Choose a reason for hiding this comment

Uh oh!

xtqqczze commented Jan 24, 2023

Uh oh!

jeffhandley commented Feb 14, 2023

Uh oh!

neon-sunset commented Feb 15, 2023

Uh oh!

neon-sunset commented Feb 24, 2023

Uh oh!

xtqqczze commented Feb 24, 2023

Uh oh!

xtqqczze commented Feb 24, 2023

Uh oh!

jeffhandley commented Mar 27, 2023

Uh oh!

neon-sunset commented Mar 27, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

xtqqczze commented Mar 30, 2023

Uh oh!

ghost commented Apr 18, 2023

Uh oh!

ghost commented May 3, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

neon-sunset commented Jan 13, 2023 •

edited

Loading

neon-sunset commented Jan 13, 2023 •

edited

Loading

neon-sunset commented Jan 13, 2023 •

edited

Loading

stephentoub commented Jan 16, 2023 •

edited

Loading

neon-sunset commented Jan 17, 2023 •

edited

Loading

neon-sunset commented Jan 17, 2023 •

edited

Loading

neon-sunset commented Mar 27, 2023 •

edited

Loading